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A SIMPLE PROOF OF KAIJSER'S UNIQUE ERGODICITY 
RESULT FOR HIDDEN MARKOV a-CHAINS 

By Fred Kochman and Jim Reeds 

Center for Communications Research 

According to a 1975 result of T. Kaijser, if some nonvanishing 
product of hidden Markov model (HMM) stepping matrices is sub- 
rectangular, and the underlying chain is aperiodic, the corresponding 
Q-chain has a unique invariant limiting measure A. 
Here the a-chain {a n } = {(a n i)} is given by 

a ni = P(X n = i\Y n ,Y n - 1 ,...) > 

where {(X n ,Y„)} is a finite state HMM with unobserved Markov 
chain component {^Vi} and observed output component {Ki}. This 
defines {a n } as a stochastic process taking values in the probability 
simplex. It is not hard to see that {a n } is itself a Markov chain. 
The stepping matrices M(y) = (M(y)ij) give the probability that 
(X n ,Y n ) — (j,y), conditional on X„-i — i. A matrix is said to be 
subrectangular if the locations of its nonzero entries forms a cartesian 
product of a set of row indices and a set of column indices. 

Kaijser's result is based on an application of the Furstenberg- 
Kesten theory to the random matrix products M{Y\)M(Y2) ■ ■ ■ M (Y n ). 
In this paper we prove a slightly stronger form of Kaijser's theorem 
with a simpler argument, exploiting the theory of e chains. 

1. Introduction. In 1975 Kaijser [9] gave a simple sufficient condition 
for the uniqueness of the invariant measure for the so-called a-chain, a cer- 
tain weak Feller chain with compact state space arising in the study of an 
arbitrary finite state hidden Markov model. This provided an elegant par- 
tial answer to a question posed by David Blackwell in 1957 [3]. (We follow 
Blackwell in using a n to denote the state of the a-chain at time n. Kaijser 
calls it Z n .) The transition behavior of a finite state hidden Markov model 
(HMM) and of its associated a-chain can be specified by a finite collection 
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of substochastic matrices, the stepping matrices; and probability calcula- 
tions with HMMs involve, at least conceptually, lengthy matrix products of 
stepping matrices. Accordingly, Kaijser's analysis utilized the Furstenburg- 
Kesten theory [6] of random matrix products. However, by exploiting the 
theory of e-chains, in particular Theorem 18.4.4 of [11], we are able to give 
a simpler proof of a result slightly stronger than that in [9] . 

Briefly, an HMM [2] consists of a pair of stochastic processes, {X n } and 
{Yn}, taking values in finite sets X and y, such that is a Markov 

chain and each Y n is a probabilistic function of (X n _i,X n ). In modeling 
applications [5, 10], the "observable" marginal process {Y n } is "output" 
from the "hidden" process {X n }. 

The transition structure of (X n ,Y n ) can be specified by the stepping ma- 
trices M(y) = (M(y)ij), given by 

M{y) t3 = P((X n+1 ,Y n+1 ) = (j,y)\X n = i); 

their sum M = ^ y M{y) is equal to the transition matrix of the Markov 
chain {X n }. 

Let A be the finite-dimensional simplex of probability measures on X, and 
provisionally set ao E A to be the marginal distribution of Xq and, for n > 0, 
set a n S A to be the conditional distribution of X n given {If : 1 < t < n}. It 
can be shown [3, 9], that {a n } is a Markov chain with A as its (continuous) 
state space. 

Blackwell [3] first studied the a-chain and posed the question of when its 
transition law has a unique invariant measure. His partial answer is based 
on contractivity hypotheses far stronger than Kaijser's hypotheses, or ours. 
For Blackwell, a n is the conditional distribution of X n given the infinite 
past {Yt :—oo <t < n}, where now {^ n } is assumed to be stationary, so 
{a n } is stationary as well. This {a n } is again a Markov chain, with the same 
transition law as the provisional {a n } defined above. (Blackwell's motivation 
for studying the a-chain is a formula expressing the entropy of {Y n } in terms 
of the distribution of his version of a n .) By starting the a-chain in the 
infinite past, so it is in effect born stationary, Blackwell avoids questions of 
convergence to a limiting distribution. But by allowing the X chain to start 
at finite time t = 0, with arbitrary distribution, Kaijser opens the possibility 
that the a-chain could be nonstationary, which, in turn, raises the additional 
question about whether the finite-time distributions of a n converge to a 
stationary limit measure. These two versions of the a-chain have the same 
transition mechanism, but usually different initial or marginal distributions 
on A. 

For our result, we allow {a n } to have any initial distribution on A at time 
t = 0, but with the same transition law as above. Of course, this destroys 
the original motivating interpretation of conditional distribution of 
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X n ; but it is the (unchanged) transition mechanism whose properties are of 
primary interest to us, rather than any particular realization of the chain. 

We now prepare to state Kaijser's result. Following Kaijser, call a non- 
negative matrix (Dij) subrectangular if the set of subscript pairs with 
Dij > forms a Cartesian product, that is, if there exist sets R and C of 
row and column subscripts so that > if and only if (i, j) E Rx C. Call 
the matrix M (and the chain {X n }) irreducible if for every pair of sub- 
scripts (i,j), there is some k [whose value may depend on for which 
(M k )ij > 0. If there is a single value of k such that for all (i,j) we have 
(M k )ij > 0, the matrix M (and the chain {X n }) are said to be aperiodic. 

With this terminology, the result is (slightly paraphrased): 

Theorem A ([9]). Suppose the transition matrix M is aperiodic. Sup- 
pose some nonzero product of stepping matrices is subrectangular. Then the 
probability distribution of a n converges weakly to a unique limit measure, 
independent of the initial distribution for ctQ. 

Kaijser's argument is along the following lines. A path for the chain 
starting from ocq, can be written as 

aoM(Yi) a Q M(Y 1 )M(Y 2 ) a M(Y 1 )M(Y 2 ) • • • M{Y k ) 
Oo ' aoM (Yi)e' a Q M{Y l )M{Y 2 )e ' ' ' ' a M(Y 1 )M(Y 2 ) • • • M(Y k )e 3 ' ' ' ' 

where {Y n } is the marginal process defined above and e is the column vector 
of all l's. Since the sequence {M(Y n )} of matrices is itself a stochastic pro- 
cess, Kaijser was able to cleverly adapt methods of the Furstenberg-Kesten 
theory to the present subject. 

However, we have a different line of argument, based on the theory of 
e-chains, which we think is ultimately easier to understand. We replace the 
subrectangularity condition with the following, which we will show to be 
weaker. Let A4 be the set of stepping matrices, let A4* be the set of all 
finite products of elements of A4, and let C = M + A4* be the cone on M* , 
that is, all positive scalar multiples of elements of M* . Our condition is that 
the closure, C, should contain a matrix of rank 1. 

A very brief sketch of our argument is as follows. 

First, in Theorem 1 we show our key technical result, that for an arbitrary 
transition matrix M and arbitrary decomposition into stepping matrices, 
{a n } is an e-chain, in the sense of [11], page 144. Then we exploit the asso- 
ciated limit theory. Namely, given our rank 1 hypothesis, when the matrix 
M is irreducible, we show that the state space A possesses a topologically 
reachable point v, in the sense of [11], page 455. Further, if M is also ape- 
riodic, then v must be topologically aperiodic in the sense of [11], page 459, 
as well. Since A is compact, Theorem 18.4.4 of [11], page 460, immediately 
applies, yielding the proof of our main result: 
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Theorem 2. Let the matrix M be irreducible. Suppose there exists a 
rank 1 element of C. Then Tj^i has a unique stationary distribution A, and 
{Tj^) n n — > A weakly in Cesaro mean, for each probability measure \i on 
A. // ; in addition, M is aperiodic, then also (Tj^) n /i — ► A weakly, for each 
probability measure on A. 

[Here Tj^ denotes the Markov operator on C(A) associated with the et- 
ch am.] 

Finally, we derive Kaijser's Theorem A from our Theorem 2, by showing 
that if his subrectanglarity hypothesis is true, so is our rank 1 hypothesis. 

We conclude the paper with three calculations. The first shows that aperi- 
odicity of M, by itself, does not imply uniqueness of the stationary measure 
for the a-chain. Another shows that in Kaijser's theorem the condition of 
aperiodicity cannot be replaced with that of irreducibility. The third shows 
that an example of Kaijser's, while not satisfying the conditions of his result 
(Theorem A), does satisfy the conditions of ours (Theorem 2). 

We now address the relation of this work to "random systems with com- 
plete connections" [4] and Chapter 2 of [8] and "place dependent random 
iterated function systems" (IFS) [1]. The a-chain seems — ignoring techni- 
calities — to fall under the scope of these theories, so one might suppose 
Kaijser's theorem followed as a corollary of standard IFS results. We have, 
however, been unable to derive Kaijser's results this way. Our main obstacle 
is that the state update functions for a-chains are only defined, in general, 
on open dense subsets of A and need not extend continuously to all of A; nor 
do they seem to satisfy the conventional contractivity or mean contractivity 
hypotheses imposed in the IFS literature. Since we ultimately rely on the 
classical Perron theorem for aperiodic matrices, we too are exploiting a kind 
of contractivity; the difference seems to be that it enters at a later stage of 
the argument. 

In common with Kaijser's argument, ours does exploit the special role 
played by matrix products in a-chain calculations. We find it striking how 
smoothly the theory of e-chains may be applied without clutter to the matrix 
product formulation, once the necessary ground work is completed. There 
does not seem to be any easy analogue of this matrix product structure in 
the generic IFS example. 

2. Notation, formulae. Let X and y be finite sets, with s = \X\, and 
let Acl s be the simplex of probability measures on X. Let e be the 
s-long column vector of all l's. Let C(A) be the space of all continuous 
real-valued functions on A, with the sup-norm topology. Let 'P(A) denote 
the probability measures on A, equipped with with the weak topology. 
Let M = {M(y) :y S 3^} be a family of nonnegative matrices whose sum, 
M = J2 y M(y), is a Markov transition matrix; so the stepping matrices M(y) 
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are substochastic. Let M* be the set of finite products of elements of A4. 
As a convenience, we will use notations like y = yi,yz, ■ ■ ■ , y n € y n to denote 
finite sequences of elements of y. If y € M n , let |y| = n denote its length. If 
y = yi, y2, ■ ■ ■ ,y n € y n , we will let M(y) be shorthand for the matrix prod- 
uct M(yi)M(y2) ■ ■ ■ M{y n ). We use the term word to refer (indiscriminately) 
to tuples y or matrix products M(y). 

For each [i S V(A), the a-chain may be concisely defined as follows: Pick 
ao according to /i; conditional on oq, pick {1^} so that P{Y\ = yi, . . . , Y n = 
y n ) = aoM(yo) • • • M(y n )e, and then conditional on oiq and Y\, . . . , 3^, define 
{a n } to satisfy the conditionally certain recursion 

a n _iM(y n ) 



which is to say, 



a, 



«n-iM(y„)e' 

a M(y 1 )---M(y n ) 



ooM(yi)---M(y n )e" 

Though {1^} is not generally Markov of any finite order, it is known [3, 9] 
that {a n } is a Markov chain on A whose transition law is given by the 
transition kernel 

P(a,A) = J2'uM(y)e, 
y 

where the sum extends over all y € y such that 

aM(y)e 

The chain is weak Feller: its Markov operator 

T M :C(A)^C(A) 

is given by the formula 

P-«/)W-E(»«(»)«)/0 

where any term with aM(y)e = is set equal to 0. For later use, we record 
the telescoped n-step transition formulas 

(1) P>,yl) = 5>Af(y)e, 

y 

where the sum extends over all y £ A4 n such that aM (y) / aM (y)e E A, and 

aM(y) 



(2) (T]Uf)(a)= (»M(y)e)f 

yeM n 



aM(y)e 
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Since A is compact, it is immediate that at least one T^-invariant prob- 
ability law exists; part of what is at issue is whether there is more than 
one. If there is a unique Ta/( -invariant probability law, we say the a-chain is 
uniquely ergodic. 

3. The a-chain is an e-chain. A weak Feller chain with compact state 
space S is an e-chain if its operator T on C(S) is such that, for each / £ 
C(S), the set of functions {T n f:n > 0} is equicontinuous. We show that 
every a-chain is an e-chain: 

Theorem 1. Let M be any transition matrix with any stepping decom- 
position A4. Then the corresponding a-chain is an e-chain. 

Proof. By the Arzela-Ascoli theorem, since A is compact, it suffices to 
show that, for given / £ C(A), the set {Tj^f} is relatively compact. To this 
end, we will construct a compact set Kf C C(A) for which {Tj^f} C Kf. 

Let h E A be some fixed probability distribution on X for which h{ > 
for all i £ X. Let VV = {V = (vij) : Vij > 0, hVe = 1} be the s x s matrices 
with nonnegative entries obeying the linear constraint hVe = 1. 

Given /, we define a continuous function g(a, V) = aV ef (aV/ aV e) . The 
function is defined in the first instance for those (a, V) G A x VV for which 
aVe 0, and because / is bounded, g has a unique continuous extension 
to all of A x W: take g(a, V) = when aVe = 0. Let Kf C C(A) be the 
set of all functions of a obtained by integrating g(a, V) with respect to all 
probability measures on the compact set VV, that is, all functions of form 

a i ^ Eg(a, V) 

for random elements V in VV. Thus, Kf \s the closed convex hull of the 
compact set of all the functions of a obtained from g(a, V) by holding V 
fixed. Hence, Kf is also compact. 

For given n, pick a random element w £ y n with probability hM(w)e 
and set 

M(w) 
n "M//(w)e' 

which, with probability 1, is a matrix in VV. Then, referring to (2), we see 
that 

(T% /l f)(a) = Eg(a,V n ), 

so Tjl^f £ if j. Thus, {Tj^/} is contained in Kf and, hence, is equicontinu- 
ous. □ 
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4. Main result. We now embark on the proof of our main result. Recall 
that C is the set of all positive scalar multiples of matrices in A4* and that 
its closure is C. 

Theorem 2. Let M and A4 be given. Suppose M is irreducible and 
suppose that C contains an element of rank 1. Then Tj^ has a unique sta- 
tionary distribution X, and (T^) n fj l — > A weakly in Cesaro mean, for each 
\x £ V(A). Suppose, in addition, that M is aperiodic. Then (TL) n /i — > X 
weakly, for each \x £ V(A). 

Proof. We will use the rank 1 element of C to construct a topologically 
reachable point v £ A. If M is aperiodic, v will also be topologically aperi- 
odic. According to Theorem 18.4.4 of [11], page 460, the existence of such 
v, the fact that we are working with an e-chain, and the compactness of A 
together imply the stated results. 

Suppose R GC has rank 1, so R = uv, where m / is a nonnegative col- 
umn vector and v a nonnegative row vector which we may assume scaled 
so ve = 1. In particular, v £ A. We will show that if M is irreducible, then 
v is topologically reachable, that is, for each a £ A and each open set O 
containing v, there exists a k > such that P k (a,0) > 0. 

For each a £ A, there is some word M(z) such that aM(z)u > 0, as 
follows. There are certainly i and j with aj > 0, Uj > 0, and since for some 
k, we have J2zeM k ^( z )ij = (M k )ij > 0, we must have aM(z)u > for some 
z £ M k . 

As a consequence, aM(z)R is a nonzero multiple of v. But R is a limit 
of rescaled words: 

R= lim M{y n )/s n 

n— >oo 

for some sequence of words y n and reals s n > 0. For all n sufficiently large, 
aM{z)M(y n )e > 0, so 

aM(z)R aM(z)M(y n ) 

v = = nm . 

aM(z)Re n^oo a M{z)M{y n )e 

This implies that, for any neighborhood O of v, for n large enough, we must 
also have 

aM(z)M(y n ) cg 
aM(z)M(y n )e 

Hence, referring to (1), P k (a,0) > aM{z)M(y n )e > for k = |z| + |y n |, and 
so v is topologically reachable. 

If M is also aperiodic, then a strengthening of this argument yields a 
positive lower bound on P k (a,0) which is uniform in large k, showing 
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that v is a topologically aperiodic state. Let y n , u, v and R = uv be as 
above, and let tt be the stationary probability vector for M. Pick z so 
ttM(z)u > 0, let w = M(z)u/irM(z)u, and define S n = M(z)M(y„). Then 
lim n ^oo S n /irS n e = wv. 

Let || • || denote the l\ norm for row vectors and the induced operator 
norm for matrices acting on row vectors on the right, so for row vector a 
and matrix T we have \aTe\ < \\aT\\ < ||a||||T||. Let B be the closed unit l\ 
ball in ]R S . Then there exist matrices T n and scalars 8 n > so that 



nS n e 



wv + 6 n T n , 



with ||T n || < 1 and lrm n _ >00 5 n = 0. Now pick n so large that irS n e > and 
that S n is sufficiently small that both 5 n < 1/4 and, for all (3 G B, 



G O. 



1 + V8^f3e 

Finally, let t = 2\J~5^ l TtS n e and let m = |z| + |y n |. 

Given all these choices, we claim that, for all a £ A, 

(3) P k+rn (a,0)>aM k S n e-t 

for all k > 0. If so, since M is aperiodic, aM k — > tt as k 

liminf P k+m (a,0) > TrS n e - t 

k— >oo 



OO, SO 



= (1-2^)^00. 

Letting a = v shows, in particular, that v is a topologically aperiodic state. 

To prove (3), first assume k = 0. Since (3) is then trivially true if aS n e < t, 
we may assume aS n e > t. But in that case S n steps a into O as follows. Since 

(y.S €■ 

aw + 5 n aT n e = n > 2\fb^, 



7rS n e 

we get a lower bound on the scalar aw: 

aw > 2\j% t - 8 n aT n e > 2\fb' n - 5 n > V^- 
Let (3 = \/~5~^aT n /aw, so j3 E B. Then 

aSn aSn /aSr>,e 



aS n e 



TrS n e/ nS n e 
awv + 5 n aT n 
aw + 5 n aT n e 



1 + V6^pe 



e O. 
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Hence, the word S n G A4 m steps a into O. This implies P m (a,0) > aS n e, 
verifying (3) when k = 0. 
For k > 0, we have 

P k+m (a,0) = Y aM(w)eP m f " M / w) ,0 

|w|=fe 



|w|=fc 



aM(w)e 



= ^ aM(w)S n e- ^ aM(w)et 

|w|=fc |w|=fc 

= aM k S n e - t, 

concluding the verification of (3). 

By Theorem 1, {a n } is an e-chain; it is obviously bounded in probability 
on average in the sense of [11], page 285, since A is compact. 

Hence, Theorem 18.4.4 of [11], page 460, applies, and our theorem follows. 

□ 

5. Kaijser's result. We are now in a position to derive Kaijser's theorem 
from our Theorem 2. 

Theorem A ([9]). Suppose M is aperiodic. If there is a nonvanishing 
subrectangular M(y) G Ai* , then there exists a unique T ^.-invariant proba- 
bility measure X, and for all /j, G "P(A), we have (T^) 11 ^ — > A {weakly) as 
n — > co . 

Proof. First, we find a nonvanishing subrectangular word G = M(z) 
with a positive entry in its (1, 1) position. If M(y) does not already have 
this property, we pick (i,j) such that M{y)ij > 0, and then, as in the 
proof of Theorem 2, find words M (u) and M(v) such that M(u)n > and 
M (v)ji > 0. Let z = uyv. The product of a subrectangular matrix and a 
nonnegative matrix is subrectangular, so M(z) = M(u)M(y)M(v) has the 
desired property. 

Let R and C be the sets of row and column indices which specify where 
Gij > 0. That is, Gy > if and only if i G R and j G C. For notational 
convenience, pretend that R = Si U Sn and C = Si U 5m, where Si, Sn, Sin 
and Siy are a partition of X into blocks of consecutive integers. That is to 
say, G has block structure 



G 



I A B 0\ 

C D 



\o 0/ 
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where all of the entries in blocks A, B, C and D are strictly positive, and the 
blocks on the diagonal are square. In particular, since G\\ > 0, the upper 
left block A is k x k, for some k > 1. 
Check by induction that, for n > 2, 



G r ' 



C 




A 



n-2 



(a o b o; 



By the Perron theorem [7], page 502, a suitably rescaled version of A n has 
a limit: 



lim n A r 

n— >oo 



where > is the reciprocal of the spectral radius of A, all elements of A 
are strictly positive, and A has rank 1. 

Hence, for some sequence of scaling constants s n , 

( A \ 

C 



also has rank 1. Since each G n £ A4* , we have exhibited a rank 1 element 
of C and the result then follows from Theorem 2. □ 



lim G n /s n 



a{a o b o; 



6. Three examples and a question. First, we give an example showing 
that the assumption of aperiodicity, by itself, is not enough to guarantee 
unique ergodicity of the a-chain. The matrices 

^Oo 2 I/O" "M = (l/» 
give rise to an a-chain with the following simple description: a n = (u, v) £ 
Act 2 moves with probability 1/2 to a n +i = (u,v) and with probability 
1/2 to (v,u). Thus, \u — v\ is a nontrivial invariant for the a-chain, which 
therefore has multiple stationary distributions. Examples of such include 
the following: the uniform distribution on A, the point mass at (1/2,1/2) 
and, for each < u < 1/2, the measures assigning probability 1/2 to each of 
(u, 1 — u) and (1 — u,u). (A similar example appears in [9].) 

Next, we give an example showing that the assumption of aperiodicity 
cannot be replaced by irreducibility in Kaijser's result. The matrices 

*«»=(! o). "w=($ o) 

specify an HMM satisfying the subrectangularity condition. M is clearly 
irreducible but not aperiodic. If the starting measure \x puts mass 1 at 
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ao = (x, 1 — x), where x ^ {0,1/2,1}, the subsequence a<m has one limit 
distribution [which puts mass x at (1,0) and mass 1 — x at (0, 1)] and the 
subsequence «2n+i has a different limit distribution [which puts mass x at 
(0, 1) and mass 1 — x at (1, 0)]. 

Finally, at the end of his paper Kaijser conjectures that if p 7^ g, the 
a-chain for the HMM with two stepping matrices 



M(0) 



/ p 0\ 

1/2 

1/2 

V 1/2 0/ 



and M(l) 



/0 q \ 

1/2 

1/2 

V0 1/2 / 



has a unique stationary distribution, even though there are no nonzero sub- 
rectangular words. This conjecture is true, as we now show. 

Applying the method used in our proof of Theorem A, consider the 
rescaled limits of M(0) n . It is easy to check that 



M(0) n 



/ p n 0\ 

l/2 n 

p n /2p 

l/2 n 0/ 



So if p> 1/2, 



M(0) n 



2p 



lim 

n->oo e 'M(0) n e 2p + l 



/ 1 0\ 



l/2p 

V 000/ 



and if p< 1/2, 



Af(0) n 
hm — — 

rwoo e 'M(0) n e 



/o 0\ 

1/2 



\0 1/2 0/ 



In either case the limit has rank 1, so, by Theorem 2, the a-chain has a 
unique invariant measure. 

Thus, Kaijser's subrectangularity condition is sufficient but not necessary. 

In light of our proof, as well as this example, one may ask the following 
question: Is the condition that C contains a rank 1 matrix a necessary and 
sufficient condition for the a-chain to have a unique invariant measure, when 
M is irreducible? 
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